System and method for managing records through establishing semantic coherence of related digital components including the identification of the digital components using templates

ABSTRACT

A method for managing electronic records is provided. Each electronic record includes a data file, a plurality of data files, a portion of a data file, or portions of a plurality of data files. The electronic records include a plurality of record types and data file types. The method includes forming a data file set comprising one or more logically related data files; identifying attributes of each record type in a record type template; identifying specifications of each data file type in a data file type template; and extracting digital components from the data file set. The extracted digital components relate to the attributes in each record type template and the specifications in each data file type template and compose an individual record. An electronic record archive includes record type and data file type templates and a digital component extractor.

CROSS-REFERENCES TO RELATED APPLICATIONS

This application claims the benefit of U.S. Applications 60/802,875,filed May 24, 2006, and 60/797,754, filed May 5, 2006, each of which isincorporated herein by reference in its entirety.

FIELD OF THE INVENTION

The example embodiments disclosed herein relate to systems and methodsfor managing records through establishing semantic coherence of relateddigital components including the identification of the digitalcomponents using templates.

BACKGROUND AND SUMMARY OF THE INVENTION

Since the earliest history, various institutions (e.g., governments andprivate companies alike) have recorded their actions and transactions.Subsequent generations have used these archival records to understandthe history of the institution, the national heritage, and the humanjourney. These records may be essential to support the efficiency of theinstitution, to protect the rights of individuals and businesses, and/orto ensure that the private company or public corporation/company isaccountable to its employees/shareholders and/or that the Government isaccountable to its citizens.

With the advance of technology into a dynamic and unpredictable digitalera, evidence of the acts and facts of institutions and the governmentand our national heritage are at risk of being irrecoverably lost. Thechallenge is pressing—as time moves forward and technologies becomeobsolete, the risks of loss increase. It will be appreciated that a needhas developed in the art to develop an electronic records archivessystem and method especially, but not only, for the National Archivesand Records Administration (NARA) in a system known as ElectronicRecords Archives (ERA), to resolve this growing problem, in a way thatis substantially obsolescence-proof and policy neutral. Whileembodiments of the invention will be described with respect to itsapplication for safeguarding government records, the describedembodiments are not limited to archives systems applications nor togovernmental applications and can also be applied to other large scalestorage applications, in addition to archives systems, and forbusinesses, charitable (e.g., non-profit) and other institutions, andentities.

One aspect of the invention is directed to an architecture that willsupport operational, functional, physical, and interface changes as theyoccur. In one example, a suite of commercial off-the-shelf (COTS)hardware and software products has been selected to implement and deployan embodiment of the invention in the ERA, but the inventivearchitecture is not limited to these products. The architecturefacilitates seamless COTS product replacement without negativelyimpacting the ERA system.

Another aspect of the ERA is to preserve and to provide ready access toauthentic electronic records of enduring value.

In one embodiment, the ERA supports and flows from NARA's mission toensure “for the Citizen and the Public Servant, for the President andthe Congress and the Courts, ready access to essential evidence.” Thismission facilitates the exchange of vital ideas and information thatsustains the United States of America. NARA is responsible to theAmerican people as the custodian of a diverse and expanding array ofevidence of America's culture and heritage, of the actions taken bypublic servants on behalf of American citizens, and of the rights ofAmerican citizens. The core of NARA's mission is that this essentialevidence must be identified, preserved, and made available for as longas authentic records are needed—regardless of form.

The creation and use of an unprecedented and increasing volume ofFederal electronic records—in a wide variety of formats, using evolvingtechnologies—poses a problem that the ERA must solve. An aspect of theinvention involves an integrated ERA solution supporting NARA's evolvingbusiness processes to identify, preserve, and make available authentic,electronic records of enduring value—for as long as they are needed.

In another embodiment, the ERA can be used to store, process, and/ordisseminate a private institution's records. That is, in an embodiment,the ERA may store records pertaining to a private institution orassociation, and/or the ERA may be used by a first entity to store therecords of a second entity. System solutions, no matter how elegant, maybe integrated with the institutional culture and organizationalprocesses of the users.

Since 1934, NARA has developed effective and innovative processes tomanage the records created or received, maintained or used, anddestroyed or preserved in the course of public business transactedthroughout the Federal Government. NARA played a role in developing thisrecords lifecycle concept and related business processes to ensurelong-term preservation of, and access to, authentic archival records.NARA also has been instrumental in developing the archival concept of anauthentic record that consists of four fundamental attributes: content,structure, context, and presentation.

NARA has been managing electronic records of archival value since 1968,longer than almost anyone in the world. Despite this long history, thediverse formats and expanding volume of current electronic records posenew challenges and opportunities for NARA as it seeks to identifyrecords of enduring value, preserve these records as vital evidence ofour nation's past, and make these records accessible to citizens andpublic servants in accordance with statutory requirements.

The ERA should support, and may affect, the institution's (e.g., NARA's)evolving business processes. These business processes mirror the recordslifecycle and are embodied in the agency's statutory authority:

-   -   Providing guidance to Federal Agencies regarding records        creation and records management;    -   Scheduling records for appropriate disposition;    -   Storing and preserving records of enduring value; and/or    -   Making records available in accordance with statutory and        regulatory provisions.

Within this lifecycle framework, the ERA solution provides an integratedand automated capability to manage electronic records from: theidentification and capture of records of enduring value; through thestorage, preservation, and description of the records; to access controland retrieval functions.

Developing the ERA involves far more than just warehousing data. Forexample, the archival mission is to identify, preserve, and makeavailable records of enduring value, regardless of form. This three-partarchival mission is the core of the Open Archival Information System(OAIS) Reference Model, expressed as ingest, archival storage, andaccess. Thus, one ERA solution is built around the generic OAISReference Model (presented in FIG. 1), which supports these corearchival functions through data management, administration, andpreservation planning.

The ERA may coordinate with the front-end activities of the creation,use, and maintenance of electronic records by Federal officials. Thismay be accomplished through the implementation of disposition agreementsfor electronic records and the development of templates or schemas thatdefine the content, context, structure, and presentation of electronicrecords along with lifecycle data referring to these records.

The ERA solution may complement NARA's other activities and priorities,e.g., by improving the interaction between NARA staff and theircustomers (in the areas of scheduling, transfer, accessioning,verification, preservation, review and redaction, and/or ultimately theease of finding and retrieving electronic records).

Like NARA itself, the scope of ERA includes the management of electronicand non-electronic records, permanent and temporary records, and recordstransferred from Federal entities as well as those donated byindividuals or organizations outside of the government. Each type ofrecord is described and/or defined below.

ERA and Non-Electronic Records: Although the focus of ERA is onpreserving and providing access to authentic electronic records ofenduring value, the system's scope also includes, for example,management of specific lifecycle activities for non-electronic records.ERA will support a set of lifecycle management processes (such as thoseused for NARA) for appraisal, scheduling, disposition, transfer,accessioning, and description of both electronic and non-electronicrecords. A common systems approach to appraisal and scheduling throughERA will improve the efficiency of such tasks for non-electronic recordsand help ensure that permanent electronic records are identified asearly as possible within the records lifecycle. This same commonapproach will automate aspects of the disposition, transfer,accessioning, and description processes for all types of records thatwill result in significant workflow efficiencies. Archivists,researchers, and other users may realize benefits by having descriptionsof both electronic and non-electronic records available together in apowerful, universal catalog of holdings. In an embodiment, some of ERA'scapabilities regarding non-electronic records may come from subsumingthe functionality of legacy systems such the Archival Research Catalog(ARC). To effectively manage lifecycle data for all types of records, incertain embodiments, ERA also may maintain data interchange (but notsubsume) other legacy systems and likely future systems related tonon-electronic records.

Permanent and Temporary Records: There is a fundamental archivaldistinction between records of enduring historic value, such as thosethat NARA must retain forever (e.g., permanent records) and thoserecords that a government must retain for a finite period of time toconduct ongoing business, meet statutory and regulatory requirements, orprotect rights and interests (e.g., temporary records).

For a particular record series from the US Federal Government, NARAidentifies these distinctions during the record appraisal and schedulingprocesses and they are reflected in NARA-approved disposition agreementsand instructions. Specific records are actually categorized as permanentor temporary during the disposition and accessioning processes. NARAtakes physical custody of all permanent records and some temporaryrecords, in accordance with approved disposition agreements andinstructions. While all temporary records are eventually destroyed, NARAultimately acquires legal (in addition to physical) custody over allpermanent records.

ERA may address the distinction between permanent and temporary recordsat various stages of the records life-cycle. ERA may facilitate anorganization's records appraisal and scheduling processes wherearchivists and transferring entities may use the system to clearlyidentify records as either permanent or temporary in connection with thedevelopment and approval of disposition agreements and instructions. TheERA may use this disposition information in association with thetemplates to recognize the distinctions between permanent and temporaryrecords upon ingest and manage these records within the systemaccordingly.

For permanent records this may involve transformation to persistentformats or use of enhanced preservation techniques to insure theirpreservation and accessibility forever. For temporary records, NARA'sRecords Center Program (RCP) is exploring offering its customers an ERAservice to ingest and store long-term temporary records in persistentformats. To the degree that the RCP opts to facilitate their customers'access to the ERA for appropriate preservation of long-term temporaryelectronic records, this same coordination relationship withtransferring entities through the RCP will allow NARA to effectivelycapture permanent electronic records earlier in the records lifecycle.In the end, ERA may also provide for the ultimate destruction oftemporary electronic records.

ERA and Donated Materials: In addition to federal records, NARA alsoreceives and accesses donated archival materials. Such donatedcollections comprise a significant percentage of NARA's PresidentialLibrary holdings, for example. ERA may manage donated electronic recordsin accordance with deeds of gift of deposit agreements which, whenassociated with templates, may ensure that these records are properlypreserved and made available to users. Although donated materials mayinvolve unusual disposition instructions or access restrictions, ERAshould be flexible enough to adapt to these requirements. Sinceindividuals or institutions donating materials to NARA are likely to beless familiar with ERA than federal transferring entities, the systemmay also include guidance and tools to help donors and the NARAappraisal staff working with them insure proper ingest, preservation,dissemination of donated materials.

Systems are designed to facilitate the work of users, and not the otherway around. One or more of the following illustrative classes of usersmay interact with the ERA: transferring entity; appraiser; recordsprocessor; preserver; access reviewer; consumer; administrative user;and/or a manager. The ERA may take into account data security, businessprocess re-engineering, and/or systems development and integration. TheERA solution also may provide easy access to the tools the users need toprocess and use electronic records holdings efficiently.

NARA must meet challenges relating to archival of massive amounts ofinformation, or the American people risk losing essential evidence thatis only available in the form of electronic federal records. But beyondmitigating substantial risks, the ERA affords such opportunities as:

-   -   Using digital communication tools, such as the Internet, to make        electronic records holdings, such as NARA's, available beyond        the research room walls in offices, schools, and homes        throughout the country and around the world;    -   Allowing users to take advantage of the information-processing        efficiencies and capabilities afforded by electronic records;    -   Increasing the return on the public's investment by        demonstrating technological solutions to electronic records        problems that will be applied throughout our digital society in        a wide variety of institutional settings; and/or    -   Developing tools for archivists to perform their functions more        efficiently.

According to one aspect of the invention, there is provided a system foringesting, storing, and/or disseminating information. The system mayinclude an ingest module, a storage module, and a dissemination modulethat may be accessed by a user via one or more portals.

In an aspect of certain embodiments, there is provided a system andmethod for automatically identifying, preserving, and disseminatingarchived materials. The system/method may include extreme scale archivestorage architecture with redundancy or at least survivability, suitablefor the evolution from terabytes to exabytes, etc.

In another aspect of certain embodiments, there is provided anelectronic records archives (ERA), comprising an ingest module to accepta file and/or a record, a storage module to associate the file or recordwith information and/or instructions for disposition, and an access ordissemination module to allow selected access to the file or record. Theingest module may include structure and/or a program to create atemplate to capture content, context, structure, and/or presentation ofthe record or file. The storage module may include structure or aprogram to preserve authenticity of the file or record over time, and/orto preserve the physical access to the record or file over time. Theaccess module may include structure and/or a program to provide a userwith ability to view/render the record or file over time, to controlaccess to restricted records, to redact restricted or classifiedrecords, and/or to provide access to an increasing number of usersanywhere at any time.

The ingest module may include structure or a program to auto-generate adescription of the file or record. Each record may be transformed, e.g.,using a framework that wraps and computerizes the record in aself-describing format with appropriate metadata to representinformation in the template.

The ingest module, may include structure or a program to process aSubmission Information Package (SIP), and/or an Archive InformationPackage (AIP). The access module may include structure or a program toprocess a Dissemination Information Packages (DIP).

Independent aspects of the invention may include the ingest module aloneor one or more aspects thereof, the storage module alone or one or moreaspects thereof; and/or the access module alone or one or more aspectsthereof.

Still further aspects of the invention relate to a methods for carryingout one or more functions of the ERA or components thereof (ingestmodule, storage module, and/or access module).

The challenges faced by NARA are typical of broader archival problemsand reveal drawbacks associated with known solutions. Thus, in anembodiment, an ERA may be provided to address some or all of the moregeneral problems. In particular, archives systems exist for storing andpreserving electronic assets, which are stored as digital data.Typically, these assets are preserved for a period of time (retentiontime) and then deleted. These systems maintain metadata about the assetsin asset catalogs to facilitate asset management. Such metadata mayinclude one or more of the following:

-   -   Attributes to uniquely identify assets;    -   Attributes to describe assets;    -   Attributes to facilitate search through the archives;    -   Attributes to define asset structure and relationships to other        assets;    -   Attributes to organize assets;    -   Attributes for asset protection;    -   Attributes to maintain information about asset authenticity;        and/or    -   Status of the asset lifecycle (e.g., planning receipt of asset        through eventual deletion).

Unfortunately, these systems all suffer from several drawbacks. Forexample, there are limitations relating to the scale of the assetsmanaged and, in particular, the size and number of all the assetsmaintained. These systems also have practical limitations in theduration in which they retain assets. Typically, archives systems aredesigned to retain data for years or sometimes decades, but not longer.As retention times of assets become very long or indefinite, longevityof the archives system itself, as well as the assets archived, is neededbecause an archives system's basic requirement is to preserve assets.

But indefinite longevity of an archives system and its assets posechallenges. For example, providing access to old electronic assets iscomplicated by obsolescence of the asset's format. Regular upgrades ofthe archives system itself, including migrations of asset data and/ormetadata to new storage systems is complicated by extreme size of theassets managed, e.g., if the metadata has to be redesigned to handle newrequired attributes or to handle an order of magnitude greater number ofassets than supported by the old design, then the old metadata generallywill have to be migrated to the new design, which could entail a greatdeal of migration. Extreme scale and longevity make impractical archivessystems that are not designed to accommodate unknown, future changes andreduce the impact of necessary change as much as possible.

Archives systems today are built on top of underlying storage systemsbased on commercial products that are typically comprised of filesystems (e.g., Sun's ZFS file system) or relational databases (e.g.,Oracle), and sometimes proprietary systems (e.g., EMC Centera). All ofthese storage systems have limitations in terms of scale (thoughsometimes the limits can be quite high). In some cases, there may be noproducts that can make use of the full scale of available file systems.Few of these systems can scale to trillions of entries (e.g., files).Limitations arise for different reasons but can be related to one ormore of the following factors, alone or in combination:

-   -   Limitations of object or file identification schemes (e.g.,        uniqueness of identifiers. www.doi.org provides background on        the state of the art for electronic/digital entity        identifiers.);    -   Catalog limitations (e.g., number of entries, design        bottlenecks);    -   The number of storage subsystems that can be integrated        (sometimes termed horizontal scalability);    -   The capacity of underlying storage technologies;    -   Search and retrieval performance considerations (e.g., search        can become impractical with extreme size);    -   The ability to distribute system components (e.g., systems can        be difficult to distribute geographically); and/or    -   Limitations of system maintenance tasks that are a function of        system size (e.g., systems can become impractical to administer        with extreme size).

Currently, relational databases (DBs) can scale only to 10 billionobjects per instance. Relational DBs also generally do not perform aswell as file systems for simple search and retrieval function tasksbecause they tend to introduce additional overhead to meet otherrequirements such as fine-grained transactional integrity. There is alsono viable product that integrates multiple file systems in a way thatprovides both extreme scaling and longevity suitable for an archivesfile system.

There clearly exists a need for a system and/or method for managingrecords that allows for identifying and managing the records that is notdependent on the original hardware and/or software used to create therecords, which may have little or no records management function.

According to one embodiment of the present invention, a method isprovided for managing electronic records. Each electronic recordcomprises a data file, a plurality of data files, a portion of a datafile, or portions of a plurality of data files. The electronic recordscomprise a plurality of record types and data file types. The methodcomprises forming a data file set comprising one or more logicallyrelated data files; identifying attributes of each record type in arecord type template; identifying specifications of each data file typein a data file type template; and extracting digital components from thedata file set, wherein the extracted digital components relate to theattributes in each record type template and the specifications in eachdata file type template and comprise an individual record.

According to another embodiment of the present invention, an electronicrecord archive for managing electronic record is provided. Eachelectronic record comprises a data file, a plurality of data files, aportion of a data file, or portions of a plurality of data files. Theelectronic records comprise a plurality of record types and data filetypes. The electronic record archive comprises a data file setcomprising one or more logically related data files; a record typetemplate for each record type, each record type template identifyingattributes of each record type; a data file type template for each datafile type, each data file type template identifying specifications ofeach data file type; and a digital component extractor configured toextract digital components from the data file set. The extracted digitalcomponents relate to the attributes in each record type template and thespecifications in each data file type template and comprise anindividual record.

It will be appreciated that the above-described embodiments, and theelements thereof, may be used alone or in various combinations torealize yet further embodiments.

Other aspects, features, and advantages of this invention will becomeapparent from the following detailed description when taken inconjunction with the accompanying drawings, which are a part of thisdisclosure and which illustrate, by way of example, principles of thisinvention.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a reference model of an overall archives system;

FIG. 2 is a chart demonstrating challenges and solutions related tocertain illustrative aspects of the present invention;

FIG. 3 illustrates the notional life cycle of records as they movethrough the ERA system, in accordance with an example embodiment;

FIG. 4 illustrates the ERA System Functional Architecture from anotional perspective, delineating the system-level packages and externalsystem entities, in accordance with an example embodiment;

FIG. 5 illustrates a digital component extractor model according to thepresent invention;

FIG. 6 illustrates an XML Schema as a template for content and structureof a record;

FIG. 7 illustrates an instance of the template of FIG. 6; and

FIG. 8 illustrates an XSL template fore defining the presentation of theinstance of FIG. 7.

DETAILED DESCRIPTION

The following description includes several examples and/or embodimentsof computer-driven systems and/or methods for carrying out automatedinformation storage, processing and/or access. In particular, theexamples and embodiments are focused on systems and/or methods orientedspecifically for use with the U.S. National Archives and RecordsAdministration (NARA). However, it will be recognized that, while one ormore portions of the present specification may be limited in applicationto NARA's specific requirements, most if not all of the describedsystems and/or methods have broader application. For example, theimplementations described for storage, processing, and/or access toinformation (also sometimes referred to as ingest, storage, anddissemination) can also apply to any institution that requires and/ordesires automated archiving and/or preservation of its information,e.g., documents, email, corporate IP/knowledge, etc. The term“institution” includes at least government agencies or entities, privatecompanies, publicly traded corporations, universities and colleges,charitable or non-profit organizations, etc. Moreover, the term“electronic records archive” (ERA) is intended to encompass a storage,processing, and/or access archives for any institution, regardless ofnature or size.

As one example, NARA's continuing fulfillment of its mission in the areaof electronic records presents new challenges and opportunities, and theembodiments described herein that relate to the ERA and/or asset catalogmay help NARA fulfill its broadly defined mission. The underlying riskassociated with failing to meet these challenges or realizing theseopportunities is the loss of evidence that is essential to sustaining agovernment's or an institution's needs. FIG. 2 relates specificelectronic records challenges to the components of the OAIS ReferenceModel (ingest, archival storage, access, and datamanagement/administration), and summarizes selected relevant researchareas.

At Ingest—the ERA needs to identify and capture all components of therecord that are necessary for effective storage and dissemination (e.g.,content, context, structure, and presentation). This can be especiallychallenging for records with dynamic content (e.g., websites ordatabases).

Archival Storage—Recognizing that in the electronic realm the logicalrecord is independent of its media, the four illustrative attributes ofthe record (e.g., content, context, structure, and presentation) andtheir associated metadata, still must be preserved “for the life of theRepublic.”

Access—NARA will not fulfill its mission simply by storing electronicrecords of archival value. Through the ERA, these records will be usedby researchers long after the associated application software, operatingsystem, and hardware all have become obsolete. The ERA also may applyand enforce access restrictions to sensitive information while at thesame time ensuring that the public interest is served by consistentlyremoving access restrictions that are no longer required by statute orregulation.

Data Management—The amount of data that needs to be managed in the ERAcan be monumental, especially in the context of government agencies likeNARA. Presented herewith are embodiments that are truly scalablesolutions that can address a range of needs—from a small focusedInstance through large Instances. In such embodiments, the system can bescaled easily so that capacity in both storage and processing power isadded when required, and not so soon that large excess capacities exist.This will allow for the system to be scaled to meet demand and providefor maximum flexibility in cost and performance to the institution(e.g., NARA).

Satisfactorily maintaining authenticity through technology-basedtransformation and re-representation of records is extremely challengingover time. While there has been significant research about migration ofelectronic records and the use of persistent formats, there has been noprevious attempt to create an ERA solution on the scale required by someinstitutions such as NARA.

Migrations are potentially loss-full transformations, so techniques areneeded to detect and measure any actual loss. The system may reduce thelikelihood of such loss by applying statistical sampling, based on humanjudgment for example, backed up with appropriate software tools, and/orinstitutionalized in a semi-automatic monitoring process.

Table 1 summarizes the “lessons learned” by the Applicants fromexperience with migrating different types of records to a PersistentObject Format (POF).

TABLE 1 Type of record Current Migration Possibilities E-mail The DutchTestbed project has shown that e-mail can be successfully migrated to aPOF. An XML-based POF was designed by Tessella as part of this work.Because e-mail messages can contain attached files in any format, ane-mail record should be preserved as a series of linked objects: thecore message, including header information and message text, and relatedobjects representing attachments. These record relationships are storedin the Record Catalog. Thus, an appropriate preservation strategy can bechosen and applied to each file, according to its type. Word processingSimple documents can be migrated to a POF, although document documentsappearance can be complex and may include record characteristics. Somedocuments can also include other embedded documents which, like e-mailattachments, can be in any format. Documents can also contain macrosthat affect “behavior” and are very difficult to deal with generically.Thus, complex documents currently require an enhanced preservationstrategy. Adobe's Portable Document Format (PDF) often has been treatedas a suitable POF for Word documents, as it preserves presentationinformation and content. The PDF specification is controlled by Adobe,but it is published, and PDF readers are widely available, both fromAdobe and from third-parties. ISO are currently developing, withassistance from NARA, a standard version of PDF specifically designedfor archival purposes (PDF/A). This format has the benefit that itforces some ambiguities in the original to be removed. However, bothAdobe and Microsoft are evolving towards using native XML for theirdocument formats. Images TIFF is a widely accepted open standard formatfor raster images and is a good candidate in the short to medium termfor a POF. For vector images, the XML-based Scalable Vector Graphicsformat is an attractive option, particularly as it is a W3C openstandard. Databases The contents of a database should be converted to aPOF rather than being maintained in the vendor's proprietary format.Migration of the contents of relational database tables to an XML orflat file format is relatively straightforward. However, in some cases,it is also desirable to represent and/or preserve the structure of thedatabase. In the Dutch Digital Preservation Testbed project, this wasachieved using a separate XML document to define the data types ofcolumns, constraints (e.g., whether the data values in a column must beunique), and foreign key relationships, which define theinter-relationships between tables. The Swiss Federal Archives took asimilar approach with their SIARD tool, but used SQL statements todefine the database structure. Major database software vendors havetaken different approaches to implementing the SQL “standard” and addextra non-standard features of their own. This complicates theconversion to a POF. Another difficulty is the Binary Large Object(BLOB) datatype, which presents similar problems to those of e-mailattachments: any type of data can be stored in a BLOB and in manydocument- oriented databases, the majority of the important or relevantdata may be in this form. In this case, separate preservation strategiesmay be applied according to the type of data held. A further challengewith database preservation is that of preserving not only the data, butthe way that the users created and viewed the data. In some cases thismay be depend on stored queries and stored procedures forming thedatabase; in others it may depend on external applications interactingwith the database. To preserve such “executable” aspects of the database“as a system” is an area of ongoing research. Records with a For thistype of record, it is difficult to separate the content from high degreeof the application in which it was designed to operate. This makes“behavioral” these records time-consuming to migrate to any format.Emulation properties (e.g., is one approach, but this approach is yet tobe fully tested in an virtual reality archival environment. Migration toa POF is another approach, and models) more research is required intodeveloping templates to support this. Spreadsheets The Dutch Testbedproject examined the preservation of spreadsheets and concluded that anXML-based POF was the best solution, though did not design the POF indetail. The structured nature of spreadsheet data means that it can bemapped reliably and effectively to an XML format. This approach canaccount for cell contents, the majority of appearance related issues(cell formatting, etc), and formulae used to calculate the contents ofsome cells. The Testbed project did not address how to deal with macros:most spreadsheet software products include a scripting or programminglanguage to allow very complex macros to be developed (e.g., VisualBasic for Applications as part of Microsoft Excel). This allows aspreadsheet file to contain a complex software application in additionto the data it holds. This is an area where further research isnecessary, though it probably applies to only a small proportion ofarchival material. Web sites Most Web sites include documents instandardized formats (e.g., HTML). However, it should be noted thatthere are a number of types of HTML documents, and many Web pages willinclude incorrectly formed HTML that nonetheless will be correctlydisplayed by current browsers. The structural relationship between thedifferent files in a web-site should be maintained. The fact that mostweb-sites include external as well as internal links should be managedin designing a POF for web-sites. The boundary of the domain to bearchived should be defined and an approach decided on for how to dealwith links to files outside of that domain. Many modern web sites areactually applications where the navigation and formatting are generateddynamically from executed pages (e.g., Active Server Pages or JavaServer Pages). The actual content, including the user's preferences onwhat content is to be presented, is managed in a database. In this case,there are no simple web pages to archive, as different users may bepresented with different material at different times. This situationoverlaps with our discussion above of databases and the applicationswhich interact with them. Sound and video For audio streams, the WAV andAVI formats are the de facto standards and therefore a likely basis forPOFs. For video, there are a number of MPEG formats in general use, withvarying degrees of compression. While it is desirable that only losslesscompression techniques are used for archiving, if a lossy compressionwas used in the original format it cannot be recaptured in a POF. Forvideo archives in particular, there is the potential for extremely largequantities of material. High quality uncompressed video streams canconsume up to 100 GB per hour of video, so storage space is an issue forthis record type.

It is currently not possible to migrate a number of file formats in away that will be acceptable for archival purposes. One aspect is toencourage the evolution and enhancement of third-party migrationsoftware products by providing a framework into which such commercialoff-the-shelf (COTS) software products could become part of the ERA ifthey meet appropriate tests.

When an appropriate POF cannot be identified to reduce the chances ofobsolescence, the format may need to be migrated to a non-permanent butmore modern, proprietary format (this is known as EnhancedPreservation). Even POFs are not static, since they still needexecutable software to interpret them, and future POFs may need to becreated that have less feature loss than an older format. Thus, the ERAmay allow migrated files to be migrated again into a new and more robustformat in the future. Through the Dutch Testbed Project, the Applicantshave found that it is normally better to return to the original file(s)whenever such a re-migration occurs. Thus, when updating a record,certain example embodiments may revert to an original version of thedocument and migrate it to a POF accordingly, whereas certain otherexample embodiments may not be able to migrate the original document(e.g., because it is unavailable, in an unsupported format, etc.) andthus may be able to instead or in addition migrate the already-migratedfile. Thus, in certain example embodiments, a new version of a recordmay be derived from an original version of the record if it is availableor, if it the original is not available, the new version may be derivedfrom any other already existing derivative version (e.g., of theoriginal). As such, an extensible POF for certain example embodimentsmay be provided.

In view of the above aspects of the OAIS Reference Model, the ERA maycomprise an ingest module to accept a file and/or a record, a storagemodule to associate the file or record with information and/orinstructions for disposition, and an access or dissemination module toallow selected access to the file or record. The ingest module mayinclude structure and/or a program to create a template to capturecontent, context, structure, and/or presentation of the record or file.The storage module may include structure and/or a program to preserveauthenticity of the file or record over time, and/or to preserve thephysical access to the record or file over time. The access module mayinclude structure or a program to provide a user with ability toview/render the record or file over time, to control access torestricted records, to redact restricted or classified records, and/orto provide access to an increasing number of users anywhere at any time.

FIG. 3 illustrates the notional life cycle of records as they movethrough the ERA system, in accordance with an example embodiment.Records flow from producers, who are persons or client systems thatprovide the information to be preserved, and end up with consumers, whoare persons or client systems that interact with the ERA to findpreserved information of interest and to access that information indetail. The Producer also may be a “Transferring Entity.”

During the “Identify” stage, producers and archivists develop aDisposition Agreement to cover records. This Disposition Agreementcontains disposition instructions, and also a related Preservation andService Plan. Producers submit records to the ERA System in a SIP. Thetransfer occurs under a pre-defined Disposition Agreement and TransferAgreement. The ERA System validates the transferred SIP by scanning forviruses, ensuring the security access restrictions are appropriate, andchecking the records against templates. The ERA System informs theProducer of any potential problems, and extracts metadata (includingdescriptive data, described in greater detail below), creates anArchival Information Package (or AIP, also described in greater detailbelow), and places the AIP into Archival Storage. At any time after theAIP has been placed into Archival Storage, archivists may performArchival Processing, which includes developing arrangement, description,finding aids, and other metadata. These tasks will be assigned toarchivists based on relevant policies, business rules, and managementdiscretion. Archival processing supplements the Preservation DescriptionInformation metadata in the archives.

At any time after the AIP has been placed into Archival Storage,archivists may perform Preservation Processing, which includestransforming the records to authentically preserve them. Policies,business rules, Preservation and Service Plans, and managementdiscretion will drive these tasks. Preservation processing supplementsthe Preservation Description Information metadata in the archives, andproduces new (transformed) record versions.

With respect to the “Make Available” phase, at any time after the AIPhas been placed into Archival Storage, archivists may perform AccessReview and Redaction, which includes performing mediated searches,verifying the classification of records, and coordinating redaction ofrecords where necessary. These tasks will be driven by policies,business rules, and access requests. Access Review and Redactionsupplement the Preservation Description Information metadata in thearchives, and produces new (redacted) record versions. Also, at any timeafter the AIP has been placed into Archival Storage, Consumers maysearch the archives to find records of interest.

FIG. 4 illustrates the ERA System Functional Architecture from anotional perspective, delineating the system-level packages and externalsystem entities, in accordance with an example embodiment. Therectangular boxes within the ERA System boundary represent the sixsystem-level packages. The ingest system-level package includes themeans and mechanisms to receive the electronic records from thetransferring entities and prepares those electronic records for storagewithin the ERA System, while the records management system-level packageincludes the services necessary to manage the archival properties andattributes of the electronic records and other assets within the ERASystem as well as providing the ability to create and manage newversions of those assets. Records Management includes the managementfunctionality for disposition agreements, disposition instructions,appraisal, transfer agreements, templates, authority sources, recordslife cycle data, descriptions, and arrangements. In addition, accessreview, redaction, selected archival management tasks for non-electronicrecords, such as the scheduling and appraisal functions are alsoincluded within the Records Management service.

The Preservation system-level package includes the services necessary tomanage the preservation of the electronic records to ensure theircontinued existence, accessibility, and authenticity over time. ThePreservation system-level service also provides the managementfunctionality for preservation assessments, Preservation and ServiceLevel plans, authenticity assessment and digital adaptation ofelectronic records. The Archival Storage system-level package includesthe functionality to abstract the details of mass storage from the restof the system. This abstraction allows this service to be appropriatelyscaled as well as allow new technology to be introduced independent ofthe other system-level services according to business requirements. TheDissemination system-level package includes the functionality to managesearch and access requests for assets within the ERA System. Users havethe capability to generate search criteria, execute searches, viewsearch results, and select assets for output or presentation. Thearchitecture provides a framework to enable the use of multiple searchengines offering a rich choice of searching capabilities across assetsand their contents.

The Local Services and Control (LS&C) system-level package includes thefunctional infrastructure for the ERA Instance including a userinterface portal, user workflow, security services, external interfacesto the archiving entity and other entities' systems, as well as theinterfaces between ERA Instances. All external interfaces are depictedas flowing through LS&C, although the present invention is not solimited.

The ERA System contains a centralized monitoring and managementcapability called ERA Management. The ERA Management hardware and/orsoftware may be located at an ERA site. The Systems Operations Center(SOC) provides the system and security administrators with access to theERA management Virtual Local Area Network. Each SOC manages one or moreFederations of Instances based on the classification of the informationcontained in the Federation.

Also shown are the three primary data stores for each Instance:

-   -   1. Ingest Working Storage—Contains transfers that remain until        they are verified and placed into the Electronic Archives;    -   2. Electronic Archives—Contains all assets (e.g., disposition        agreements, records, templates, descriptions, authority sources,        arrangements, etc.); and    -   3. Instance Data Storage—Contains a performance cache of all        business assets, operational data and the ERA asset catalog.

This diagram provides a representative illustration of how a federatedERA system can be put together, though it will be appreciated that thesame is given by way of example and without limitation. Also, thediagram describes a collection of Instances at the same securityclassification level and compartment that can communicate electronicallyvia a WAN with one another, although the present invention is not solimited. For example, FIG. 5 is a federation of ERA instances, inaccordance with an example embodiment. The federation approach isdescribed in greater detail below, although it is important to note herethat the ERA and/or the asset catalog may be structured to work withand/or enable a federated approach.

The ERA's components may be structured to receive, manage, and process alarge amount of assets and collections of assets. Because of the largeamount of assets and collections of assets, it would be advantageous toprovide an approach that scales to accommodate the same. Beyond thestorage of the assets themselves, a way of understanding, accessing, andmanaging the assets may be provided to add meaning and functionality tothe broader ERA. To serve these and/or other ends, an asset catalogincluding related, enabling features may be provided.

In particular, to address the overall problems of scaling and longevity,the asset catalog and storage system federator may address the followingunderlying problems, alone or in various combinations:

-   -   Capturing business objects that relate to assets that are        particular to the application storing the assets (e.g., in an        archiving system, such business objects may include, for        example, disposition and destruction information, receipt        information, legal transfer information, appraisals and archive        description, etc.), with each new business use of the design        potentially defining unique business objects that are needed to        control its assets and execute its business processes;    -   Maintaining arbitrary asset attributes to be flexible in        accommodating unknown future attributes;    -   Employing asset and other identifiers that are immutable so that        they remain useful indefinitely and, therefore, enable them to        be referenced both within the archives and by external entities        with a reduced concern for changes over time;    -   Supporting search and navigation through the extreme scale and        diversity of assets archived;    -   Handling obsolescence of assets that develops over time;    -   Accommodating redacted and other derivative versions of assets        appropriate for an archive system;    -   Federating (e.g., integrate independent parts to create a larger        whole) multiple, potentially heterogeneous, distributed, and        independent archives systems (e.g., instances) to provide a        larger scale archive system;    -   Supporting a distributed implementation necessary for scaling,        site independence, and disaster recovery considerations where        the distribution of assets and associated catalogs may change        over time but remain visible to all sites;    -   Employing a search architecture and catalog format that allows        exploitation of multiple, possibly commercial search engines for        differing asset data types and across instances of archives in a        federation, as future needs may dictate;    -   Accommodating multiple, heterogeneous, commercial storage        subsystems among and within the instances in a federation of        archives to achieve extreme scaling and adapt to changes over        time;    -   Supporting a variety of data handling requirements based on, for        example, security level, handling restrictions and ownership, in        a manner that performs well and remains manageable for an        extremely large number of assets and catalog entries;    -   Supporting storage of any kind of electronic asset;    -   Supporting transparent data location and migration and storage        subsystem upgrades/changes; and/or    -   Supporting reconstruction of the catalog and archives with        little or no information other than the original catalog and        archived bit streams (e.g., for the purposes of disaster        recovery).

Electronic records are manifested, in some way, as electronic datafiles. There are several requirements for managing the relationshipbetween electronic records and data files. These requirements include,but are not limited to: 1) ensuring that all data files stored in thesystem are associated with the records they constitute; 2) specifyingthe relationship of each ingested data file with an electronic record;3) specifying the relationship of each transformed data file to anelectronic record; and 4) verifying the data files associated withelectronic records contained in a transfer.

The relationship between electronic records and data files appearssimple at first glance, but is in reality somewhat complex, particularlywhen considering the relationship between an individual electronicrecord and data files, as is required by requirements 2) and 3) above.Although it is tempting to think of electronic records as being directlycomposed of data files, this is incorrect, as explained in more detailbelow.

The present solves this complexity through an intermediate layer calleda digital component extractor, which establishes a bridge betweenelectronic records and data files. This bridge allows archivists andtransferring entities to model the true semantic relationship betweenindividual electronic records and data files.

The concept of a record originates in the archival and recordsmanagement domains, where a record represents a “unit of recordedinformation”. As used herein, the term “record” means a unit of recordedinformation created, received, and maintained as evidence or informationby an organization or person, in pursuance of legal obligations or thetransaction of business.

This definition has a conceptual basis, in the sense that records arerecognized and understood by humans to represent information. It isnecessary when discussing electronic records to distinguish the archivaland records management term “record” with the computer science conceptof the same name. The computer science concept of “record” formallyrepresents a matrix-tuple in linear algebra which is analogous to a rowin a database table. The present invention uses the unqualified term“record” to indicate the archival and records management concept, anduses the qualifier “tuple record” to indicate the computer scienceconcept. As used herein, the term “tuple record” means a matrix-tuple(defined by linear algebra), which is a finite function that maps fieldnames to a certain value.

Archivists and records managers typically manage numerous records. Therequirements discussed above require the system to manage not onlyrecords (in the plural), but also individual records (in the singular).The requirement to manage both individual and plural records presentsseveral questions, including, but not limited to: 1) what defines theexact extent of an individual record? and 2) where precisely does anindividual record start and where precisely does it end?

The answers to these questions must be precisely specified in thecontext of electronic records, where individual electronic records aremanaged independently.

Given the conceptual nature of records, a conceptual approach todefining the exact extent of a particular individual record is needed. Arecord can be said to exhibit a characteristic known as strong “semanticcoherence,” which is implied by the “unit of recorded information”phrase in the definition of a record. As used herein, the term “semanticcoherence” is defined as a conceptual meaning that is closely relatedthrough connections and consistency, and holds together firmly as partsof the same mass.

Semantic coherence covers a scale, from weak (no coherence) to strong(high coherence), and the exact point on the scale for any particularset of information will involve subjective (archival) judgment. A recordrepresents conceptual meaning that “sticks together” strongly enough onthe semantic coherence scale to be considered an individual record.

Consider the following examples of semantic coherence:

EXAMPLE 1

Consider a record of a particular veteran's military service.Information about that individual's service dates, ranks, and definedbenefits is strongly logically connected. Is the same information for adifferent individual the same record? No, because the logical connectionfor information about one particular individual is very strong whereasthe logical connection for information across individuals is weaker.

EXAMPLE 2

Consider again a record of a veteran's military service. Now considerinformation about a battle plan for a particular military engagement inwhich the individual participated. Is the battle plan part of theindividual's military service record? No, while the battle plan is initself a record (and is loosely connected to the individual's servicerecord), its meaning is inconsistent with the service record, and istherefore a separate record.

Put another way, strong semantic coherence is the characteristic thatallows a distinction between one particular record and anotherparticular record.

With paper records, archivists often do not identify individual records,due to time and resource constraints. Instead, archivists typicallymanage records in the aggregate. With electronic records, archivists mayhave the capability and desire to identify individual electronic recordsas standard practice.

Each individual record has an attribute that defines its particular“record type.” As used herein, the term “record type” refers to theabstract form of the records, such as letter, memo, greeting card, orportrait, etc. As such, each record type represents a distinctive classof electronic records defined by their form. A record type represents adistinctive class of records defined by their function or use. Considerthe following example of record types:

EXAMPLE 3

A parish church will typically maintain many different types ofelectronic records, including baptismal records, deeds to parishproperties, ledgers of the parish financial accounts, minutes of parishmeetings, and official parish correspondence. Each of these differentrecord types has a distinct intellectual form. For example, baptismalrecords almost always list at least the name of the person baptized, thedate and place of birth, and the date and place of the baptism. Incontrast, financial account ledger records might include a chart ofaccounts with debit/credit entries. It would be rather surprising tofind an infant's birth date in a financial ledger.

The abstract form of a record type is specified by a “record typetemplate.” As used herein a “record type template” is template thatidentifies specific attributes for a specific type of record. The recordtype template specifies the essential characteristics of the record,which are used to ensure authenticity.

Referring again to Example 3, the record type template for baptismalrecords would identify the information expected in that type of record,such as the name of the person baptized, date and place of birth, etc.FIG. 5 illustrates the relationship between a record and a record typetemplate. A record type template specifies the form of a record.

The Record Type Template also specifies the essential characteristics ofthe record, which are used to ensure authenticity as documented inco-pending, commonly assigned U.S. Application (Attorney Docket No4870-25), entitled SYSTEM AND METHOD FOR PRESERVATION OF DIGITALRECORDS.

Electronic records are accumulated and organized into “recordaggregates” to facilitate organization and archival processing. As usedherein, the term “record aggregate” means an intellectual aggregation ofdocumentary material arising because they result from the sameaccumulation of filing process, the same function, or the same activity;have a particular form; or because of some other relationship arisingout of their creation, receipt, or use; or because the aggregate wasrequired for the purposes of archival arrangement. Record aggregates maybe composed of other record aggregates, or records.

Record aggregates can themselves be accumulated and organized intohigher order record aggregates. Consider the following example of arecord aggregates:

EXAMPLE 4

An archivist might place military service records into an aggregate forthe branch of the military (e.g., Army) which itself is within anaggregate for the Department of Defense, which itself is within anaggregate for the Federal Government.

Record aggregates may follow standard levels: record groups,collections, series, file units, and items. Each record aggregate hasname and title attributes which help identify it. Record aggregates maybe composed of other record aggregates, or electronic records. FIG. 5illustrates the relationship between electronic records and recordaggregates.

Record aggregates may either be homogeneous, i.e., they containelectronic records of the same record type, or heterogeneous, i.e., theycontain electronic records of different record types.

Like electronic records, record aggregates have a degree of semanticcoherence—they are organized according to principles of original orderand provenance, which ensures that related electronic records areaggregated together. However, the semantic coherence that binds togethera record aggregate is somewhat weaker than the semantic coherence thatbinds together a particular individual record. Put another way, anindividual record within an aggregate has an independent identitybecause its semantic coherence is “strong enough” to be considered arecord.

Computer software applications operate on data files, and data filesrepresent the atomic unit of recorded information for computers. Whereelectronic records are conceptual in nature, data files are clearlyphysical. As used herein, the term “data file” means: 1) a collection ofdata that is stored together and treated as a unit by a computersoftware application; and 2) related data (e.g., numeric, textual,and/or graphic information) and fields that are organized in a strictlyprescribed form and format. This definition includes two characteristicsof data files, which are described in more detail below.

The first characteristic is that data files typically requireinterpretation by a computer software application, which the OAIS modelcalls “access software.” The OAIS definition for “access software” is atype of software that presents part of or all of the information contentof an Information Object in forms understandable to humans or systems.

While it is conceivable that a person might look at all the individualbits of a data file to try to make sense of it, people generally useaccess software to present the information in some usable manner. Theaccess software performs some kind of “presentation processing” toaccomplish this. “Presentation processing” is defined as the softwareprocessing algorithms (including transformation, consolidation,tabulation, formatting, rendering, querying, filtering, interpretation,etc.) which access software employs to present the information containedin data files in a form understandable to humans.

Presentation processing covers a scale, from low (little to noprocessing required) to high (complex processing required), and theexact point on the scale for any particular set of information willinvolve subjective judgment. Presentation processing often involvespresenting data files visually, but could also include presenting datafiles audibly or through any other human sensory perception.

Some data files are “eye readable” with minimal presentation processing.“Eye readable” is defined as data files whose information is inherentlyunderstandable to humans through visual inspection using access softwarethat supports minimal presentation processing.

Only the simplest of data files are eye readable and most data files arecompletely unintelligible without a high degree of presentationprocessing. Using access software specifically suited to presenting acertain class of data files is necessary when the access softwareperforms a high degree of software processing because without thisaccess software, the information in the data files would beincomprehensible. Consider the following examples:

EXAMPLE 5

A fixed-length tabular dataset might be composed of one data file thatstructures tabular data into a regular row/column format that can easilybe read and understood by a person. In this case, using access softwaremight be optional.

EXAMPLE 6

A single web page might be composed of dozens of individual data files.For example, the web page might include multiple Hyper-Text MarkupLanguage (HTML) data files, multiple Cascading Style Sheet (CSS) datafiles, client-side JavaScript script files, and multiple image files invarious formats, such as Graphics Interchange Format (GIF) and PortableNetwork Graphics (PNG).

While a person could look through the individual bytes in each of theseindividual files, doing so would not provide an accurate sense of thedata files' information content. This is because the access software, aweb browser, actually performs a great deal of software processing toapply style sheets to transform and render content, more softwareprocessing to render images, and more software processing to render thebehavior contained in the client-side scripts. This kind of softwareprocessing cannot easily be imagined or replicated by a person, so usingaccess software is required.

EXAMPLE 7

Many data file formats are either undocumented, or are essentiallyincomprehensible to a person. For example, Microsoft Word's nativebinary (DOC) data file format is incompletely documented (due to thefact that it is proprietary) and is incomprehensible to a person whomight look at the individual bytes within the data file. Using accesssoftware for these kinds of data files is required.

Historically, data files created in the earlier days of computingrequire low presentation processing, but as computers, software, data,and algorithms have continually increased in complexity over time, theamount of required presentation processing has also increased.

The second characteristic is that data files have a prescribed form andformat. The above examples reference several data file formats,including Hyper-Text Markup Language (HTML) and Microsoft Word's nativebinary (DOC). This prescribed form and format is specified by a “datafile type template.” As used herein, the term “data file type template”means a set of specifications about a data type that governs its formatand behaviors.

The “specifications” in the above definition are essentially theinstructions required by the access software to perform presentationprocessing.

Data files are often aggregated to facilitate management andpresentation processing. In the web page example (Example 6), the webpage is composed of many individual data files, which is known as a“data file set.” The term “data file set” means one or more data filesthat are logically related for purposes of presentation processing byaccess software.

Data file sets can either be “explicit,” or “implicit.” “Explicit” datafile sets are defined by information contained in the data files,whereas “implicit” data file sets are defined through inscrutablesoftware processing algorithms. Consider these examples:

EXAMPLE 8

Consider again the example of a web page. When an HTML data file refersto a CSS style sheet data file, it does so explicitly by data file name.This name can be resolved to find the CSS data file.

EXAMPLE 9

Consider an example of a set of database tables that include multipledata files for different kinds of information. One data file mightcontain simple data, another might contain binary data, and yet anotherdata file might contain index information. The relationship betweenthese data files is implicit, meaning it is not specified within thedata files. Only the database application software defines theserelationships as part of its presentation processing.

FIG. 5 illustrates the relationship between data files, data file typetemplates, data file sets, and access software.

As discussed above, electronic records are conceptual and data files arephysical. Electronic records are manifested in some way as electronicdata files, but the manner in which the electronic records aremanifested must first be determined.

First, the options to describe the relationship between electronicrecords and data files should be considered. An individual record may becomposed of:

-   -   One entire data file    -   Multiple entire data files    -   A portion of one data file    -   Portions of multiple data files

All of these options may apply, as explained in the following examples,which extend the example of the parish church (Example 3).

EXAMPLE 10

The parish church maintains each baptismal record as a separate wordprocessing document data file, and its financial ledger as a separatespreadsheet data file. In this case, there is a one-to-onecorrespondence between a record and each data file.

EXAMPLE 11

The parish church maintains two separate spreadsheet data files for itsfinancial ledger record, one spreadsheet for the balance statement and asecond spreadsheet for the profit/loss statement. In this case, onerecord is composed of multiple data files.

EXAMPLE 12

The parish church has a sophisticated content management softwareapplication to manage all of its documents. The content managementapplication stores all documents (including baptismal records,correspondence, financial ledgers, etc.) in one single database datafile. In this case, one record is composed of a portion of one datafile.

EXAMPLE 13

Again, the parish church has a sophisticated content management softwareapplication to manage all of its documents. The content managementapplication stores all documents in one single database data file andall metadata about the documents in a separate database data file. Inthis case, one record is composed of portions of multiple data files.

In Examples 10-13, the intellectual form, content, and number ofelectronic records remains fixed, while the relationship of thoseelectronic records to data files varies, depending on the particulars ofhow the parish church manages and uses its data files at a specificpoint in time.

The reason that the relationship varies between a record and data filesis that a record has strong semantic coherence, while data files may nothave strong semantic coherence. A particular data file might containmany different kinds of information, or even bits and pieces ofinformation, which sometimes cannot be eye readable without significantpresentation processing and access software. In other words, semanticcoherence is not a requirement for data files per se—the semanticcoherence is realized by the presentation processing and access softwareand the human understanding gained through using that software.

The relationship between electronic records and data files, then, ispotentially many-to-many at a portion level—a record might be composedof one or more portions of data files, and data files might contain oneor more portions of electronic records.

Based on Examples 10-13, it should be appreciated that the gap betweenelectronic records (conceptual view) and data files (physical view) mustbe bridged. As the InterPARES I Preservation Task Force concluded,“Digital data inscribed on a physical medium do not have the form of arecord. It is necessary to transform the inscribed bits into the form ofthe record.” (“Preserving Electronic Records,” Presentation on the workof the InterPARES I Preservation Task Force, Jun. 19, 2002)

The present invention provides a solution to the gap between electronicrecords an data filed by adding a logical view which transforms betweenthe conceptual and physical views. To perform this task, the presentinvention provides a “digital component extractor.” As used herein, theterm “digital component extractor” is defined as a software componentthat extracts digital components from a data file set, guided by a setof instructions. A “digital component” is defined herein as a set ofdigital information that exhibits strong semantic coherence and isexpressed as a bit stream.

The purpose of the digital component extractor is to extract digitalcomponents from data files in a data file set that together comprise arecord. FIG. 5 illustrates the model, which bridges the gap betweenelectronic records and data files.

One implication of this model is that electronic records are composed ofdigital components (which exhibit strong semantic coherence) and notdata files (which can exhibit any range of semantic coherence, includingnone whatsoever). Another implication is that digital componentextractors are instructed as to how to extract digital components fromdata file sets.

Digital component extractors establish the map between data files andelectronic records, and because this map is many-to-many, the exactmethod by which digital component extractors extract digital componentsvaries. Consider the following examples:

EXAMPLE 14

If there is a one-to-one correspondence between a record and a datafile, the digital component extractor simply needs to return thespecified data file as the digital component. For example, a digitalcomponent extractor for a record that corresponds to a single wordprocessing document data file would simply return that data file as thedigital component.

EXAMPLE 15

If a record is composed of portions from one data file, the digitalcomponent extractor includes an algorithm to extract portions of thespecified data file. For example, a digital component extractor for arecord that corresponds to an e-mail archive data file would extractindividual e-mails as digital components.

EXAMPLE 16

If a record is composed of portions from more than one data file, thedigital component extractor includes an algorithm to extract portions ofthe specified data files. For example, a digital component extractor fora record that corresponds to a document spread across multiple databasetables (and data files) in a content management software applicationwould perform appropriate queries on those database tables to extractthe digital component.

Put another way, digital component extractors contain the instructionsnecessary to extract digital components from data file sets.

Table 2 documents the approaches for specifying digital componentextractors, and their advantages and disadvantages.

TABLE 2 Approach Advantages Disadvantages The transferring entitydefines The transferring entity defines Requires up-front planning andthe digital component semantic coherence early, investment by thetransferring extractors early in the records which ensures that theentity, plus a change in how lifecycle, as the records are informationcontained in the the transferring entity manages still in active usedata files is accessible information The transferring entity (with Thetransferring entity (with Requires a large time and assistance from thearchivist) assistance from the archivist) resource investment at thedefines the digital component generally has the subject area exact point(records extractors after-the-fact, as domain knowledge and managementoffices) at which part of preparing to transfer technical knowledge totransferring entities are the electronic records to ERA properly definesemantic overburdened coherence The ERA system itself The system canmake A human might make better imputes digital component reasonableassumptions about assumptions than the extractors from record type thedigital component automated ones, based on templates and data typeextractors in an automated subjective judgment. Also, the templatesmanner system might not always be able to perform this imputation (forexample, if key information is missing) An archivist defines the digitalThe archivist generally has the Requires a large time and componentextractors after- subject area domain resource investment from thethe-fact, during archival knowledge and technical archivist, which maynot scale processing knowledge to properly define to meet the electronicrecord semantic coherence archive's expected ingest volumes Theelectronic record archive The system can apply This is an area ofon-going system itself imputes semantic linguistic and pattern computerscience research, and coherence and therefore matching algorithms to atthis time this requires digital component extractors determineappropriate digital further development. from the data file contentcomponent extractors in an automated manner

It would be efficient for transferring entities to establishintellectual control over the semantic coherence of their electronicrecords as they develop their information systems, but this will notalways happen. It would also be efficient if transferring entities, withassistance from the archivist, at least defined their electronic recordsbefore the point of transfer, but again this will not always happen,because this is a burden on records officers. The system of the presentinvention imputes digital component extractors from templates asdiscussed below, and this generally will be acceptable. In the caseswhere none of these approaches work, the ERA must allow archivists toestablish intellectual control over the electronic records at an itemlevel through defining the digital component extractors.

Generally, ERA imputing the digital component extractors from therelevant templates will work quite well. Consider this example:

EXAMPLE 17

The record type template indicates a particular set of records iscorrespondence, and the data file template indicates the data file is inMicrosoft Outlook (PST) format. A reasonable set of digital componentextractors can be imputed that extract individual e-mails into separatedigital components. Each digital component represents an individuale-mail, which exhibits strong semantic coherence.

In some rare cases, there may be no workable digital componentextractors, because they are not defined by either the transferringentity or archivist, and the ERA system cannot impute reasonablealternatives. Consider this example:

EXAMPLE 18

The record type template indicates a particular set of records isgeospatial information, and the data file template is in an unknownproprietary format that is not human readable and not documented. ERAcannot impute a reasonable set of digital component extractors becauseit is not aware of the data type format.

In the case where there are no workable digital component extractors,the ERA of the present invention will create a default set of digitalcomponent extractors, known as “placeholder digital componentextractors,” which are defined as a set of digital component extractorsthat assume each data file is a single digital component

The levels of available preservation, access, and authenticity servicesthat the ERA of the present can provide may be constrained forelectronic records with placeholder digital component extractors, sothese should be the exception rather than the norm. In other words,placeholder digital component extractors are only consistent with themost basic level of service in ERA.

All of the entities modeled by the present invention, such as electronicrecords, record aggregates, digital components, data files, etc., mustbe identifiable and resolvable. An approach to identifiers is more fullydocumented in co-pending, commonly assigned U.S. Application (AttorneyDocket 4870-9), filed Apr. 26, 2007, entitled SYSTEM AND METHOD FOR ANIMMUTABLE IDENTIFICATION SCHEME IN A LARGE SCALE COMPUTER SYSTEM.

All identifiers within THE ERA must exhibit the followingcharacteristics:

-   -   The identifier must resolve to the entity which it identifies    -   The identifier must be guaranteed unique across the ERA        identifier namespace    -   The identifier for a particular entity must be immutable    -   The identifier system must scale to ten teraobjects

An approach to generating identifiers according to the present inventioninvolves using a cryptographic hash algorithm (such as SHA-256) based onthe initial content of the thing being identified. This approach meetsthe required constraints.

It should be noted that some entities have an identity which isindependent of its content. For example, the identity of a record isindependent of the content digital components and/or data files thatmake up any particular version of that record. New versions ofelectronic records can arise from redaction and preservation activities,and each record version will have its own independent identifier that isrelated back to the record.

In these cases, the identifier will be generated from the content of theentity when it is first created within ERA and immutable thereafter.Thus, the identifier for electronic records would be generated andassigned when the record is created within ERA based on the content ofthe first version's digital components, and that identifier would beimmutable thereafter.

An approach to preservation and authenticity issues are more fullydocumented in co-pending, commonly assigned U.S. application (AttorneyDocket 4870-25), entitled SYSTEM AND METHOD FOR PRESERVATION OF DIGITALRECORDS.

The notion of digital components and digital component extractors hassome interesting implications for preservation. The InterPARES IPreservation Task Force states “It is impossible to preserve anelectronic record. It is only possible to preserve the ability toreproduce an electronic record.” (“Preserving Electronic Records”,Presentation on the work of the InterPARES I Preservation Task Force,Jun. 19, 2002.) A record's digital components, along with accesssoftware, allow reproduction of the electronic record. As such, thepreservation strategy of the present invention ensures the digitalcomponent extractors produce digital components that authenticallyrepresent the record. This means that digital component extractors musthonor the essential characteristics associated with the record (andwhich are specified in the record type template).

The process of redaction involves deleting specific content from arecord to produce a new version of the record, and the new version ofthe record typically has reduced access restrictions.

In the electronic record context, digital content is contained in bothdata files and digital components, so in theory redaction (deletingdigital content) could occur in either place. In practice, mostredaction tools redact content from data files, so the present inventionwill support this approach. This means that redaction will occur againstdata files, which will produce a new version of the data files, and thedigital component extractors will produce new digital components fromthese redacted data files. This process will result in a new version ofthe record, that is composed of redacted digital components that havebeen extracted from redacted data files.

Like records, original order and arrangement are conceptual and notphysical. Thus, order and arrangement both apply to records, but notdata files. The order of data files is essentially arbitrary andmeaningless from an archival context, since data files exhibit lowsemantic cohesion.

It is possible that electronic records might have no meaningful originalorder, in the same way paper records might have no meaningful originalorder. In these cases, the present invention will follow the advice ofFrank Boles in “Disrespecting Original Order” to maintain records in astate of simple usability. (Boles, F., “Disrespecting Original Order”,The American Archivist, Vol. 45 No. 1, pp. 26-32, 1982.) Simpleusability for electronic records implies dynamic sorting, filtering, andquerying capabilities.

It is possible that the digital component extractors of the presentinvention will be executed to produce a physical representation of adigital component. In this case, a digital component would be a bitstream serialized as a managed file within the system. It is alsopossible that the digital component extractors will be executedon-demand to produce a transient digital component, as needed. In thiscase, a digital component would be a transient in-memory bit stream. Thepresent invention allow for both options, and the decisions on which touse will be a matter of policy and design.

Templates play a large part in NARA's vision of the ERA both as a meansto manage electronic records, in respect to scheduling, and as a meansto preserve records, in respect to defining preservation formats andprocessing.

Because there are many potential applications of templates, and becausetemplates are sometimes described by examples of documents that conformto the templates rather than the template itself, there is a need todefine what templates are and how they are used.

As discussed in more detail below, the present invention utilizes ataxonomy of templates and the relationships between templates andinstances of templates to identify and manage records. The presentinvention also utilizes the relationship between hierarchical templatesand hierarchical information using a matrix. Furthermore, the presentinvention provides for managing templates.

It is helpful to begin with an example of templates and instances oftemplates, and to provide an illustrative listing of some kinds oftemplates that might be used within the ERA system of the presentinvention.

According to the present invention, the use of template may beassociated with all of the following:

-   -   To describe the structure and content of record life cycle        documents that the system will help create and manage. This        includes templates for Transfer Agreements, Disposition        Agreements, Preservation Plans, etc.    -   To describe the presentation of documents.    -   To define the relationship between assets within the archive        (such as the original order of records) and within transfers of        records to the archive.    -   To describe the structure and content of archival metadata, the        contextual information which, together with the digital objects        it describes forms the records. This includes archival        description elements and life cycle data elements.    -   To describe components and resources within the system itself.        Instances of these templates include data type format templates,        templates that describe digital adaptation processes, and        resources such as Authorities Sources.    -   To describe the operation of ERA system itself. Instances of        these templates define operations such as work flow processes        that orchestrate the use of ERA system services.

It can therefore be seen that templates are being used according to thepresent invention to:

-   -   Describe the content and structure of a document—what data        elements it should contain and any relationships between those        data elements    -   Describe the content and structure of the metadata that        describes a document.    -   Describe how a document should be presented to a user, how would        its content be laid out on a screen or a printed page, and when        appropriate to describe the choreography of the presentation of        different digital objects    -   Serve as a manifest to list all the documents contained within        some collection of documents.    -   Serve as a catalog of documents describing the relationships        between them.    -   Serve as components within the ERA system, providing processing        instructions for operations that take place, such as the        orchestration of work flows or digital adaptation processing.    -   Describe components of the ERA system, such as specific data        type formats.

Some of these uses of templates have been described with reference toinstantiations of the templates and some have been described withreference to the templates themselves. It is necessary to distinguishbetween templates and instances of templates.

Using XML technologies as an example, an example of templates, andinstances of documents that conform to or are generated by thosetemplates that might be used in the preservation and presentation of adocument displayed on a web page is provided.

The first template is an XML schema that defines the structure of therecord catalog which lists the digital objects that are part of the webpage and their hierarchical relationships. An instance of that templateis a selection from the record catalog for the page in question.

Referring to FIG. 6, the next template might be an XML schema thatdefines the content and structure of the document that is to bedisplayed on the page. Each data element in the document is defined. Therelationship(s) of each data element to other data elements are alsodefined.

Referring to FIG. 7, an instance of the template of FIG. 6 is an XMLdocument (the textual content of the document) that conforms to thatschema and which includes the data elements and content of the typedefined in the schema. The instance has data elements described in theschema that hold values, which is also consistent with the schema.

Referring to FIG. 8, the next template might be an XSL template thatdefines the presentation of that XML instance in HTML on the web page(or as in some other format such as PDF). The XSL template may be aspreadsheet, or other type of template, and can be used to describe howan XML instance that conforms to an XML shema will be presented ordisplayed, for example as HTML or a PDF file. The template can also beused to transform an XML document into a variety of other formats, aswell as into a different XML document.

Other types of templates, may orchestrate a sequence of pages. Theinstantiation of that template is the web page—which is the record thatis being preserved.

Additional templates may be involved in defining the behavior of a webapplication, including templates that define the work flow within theapplication, templates that define the orchestration of pages within theapplication and templates that describe the animation of items on apage.

Table 3 provides an overview of some of the types of templates that mayoccur in the ERA of the present invention. Although each example hasbeen mapped to an appropriate XML syntax that might be used to createthe template, it should be appreciated that the present invention is notlimited to the use of any particular format. It should also beappreciated that the list of templates Table 3 is not intended to beexhaustive. There are many possible applications for templates and thereare other XML technologies, and non-XML technologies, which may be used.

TABLE 3 Indicative XML Application of Template Syntax Examples 1. RecordStructure Templates Structure of Records; Record XML Record CatalogCatalog entries Schema, Submission Information Package METS 2. LifecycleDocuments Structure and content of Life XML Transfer Agreement Cycledocuments Schema Disposition Agreement Preservation Plan Layout ofdocuments on XSL, XSL- Presentation of documents screen or paper FO 3.Archival Metadata (information specific to a record or a part of arecord) Structure and content of XML Origin, Provenance, Content,Context, etc. Archival Description Schema Structure and content of LifeXML Additions to life cycle data cycle Data Schema 4. System Components(an information component of the system, or description of a componentof the system) Structure of Authority XML Authority Sources Sources andThesauri Schema Structure and content of XML Persistent Formats wherecontent is Persistent Object Formats Schema primarily words, numbers,vectors etc. (POF) *(1) BSDL Persistent Formats where content isprimarily images, sound, etc. Digital Adaptation XSL/T Data typespecific processing templates Instructions to transform from one datatype to non-exhaustive list *(2) another Presentation of multimedia SMILTemplates to define interactions records between multiple digital itemsin multimedia presentations 5. System Metadata Description andversioning of XML Disposition Agreement template templates Schema 6.Identity & Rights Structure and content of User XML User profilesProfiles Schema Authorization Requests/ SAML Authorization of usersResponses Access Restrictions & Rights XACML Definition of accessprivileges for specific records 7. Service Architecture Work flowProcesses BPEL Orchestration of services involved in business processes,such as managing a FOIA request Services WSDL Inputs and outputs ofindividual services

Templates may be used to define the relationships between records in thearchives, such as defining the original order of records, the structureof the record catalog, and the structure of transfers to the archives orthe delivery of copies to users (Submission Information Packages andDissemination Information Packages).

Capturing the original order of a record represents a case where atemplate can be used within a template. The structure of the RecordCatalog can be described in a template that defines the informationelements that make up an entry in the catalog. The content of some ofthose information elements may be other templates, or they may be becomevalues in the instantiation of an object that conforms to anothertemplate.

Templates may be used to define the content and structure of recordsschedules and other Life Cycle Documents.

Templates may be used to define the structure of record description, andthe elements of information that compose the metadata of records.

A template for Archival Metadata, which includes description and Lifecycle data, will define which elements of information that must bepresent, what type of information they should contain, and how they arerelated to each other.

Templates may be used as inputs to processes that transform digitalobjects in the archive, including templates that may be used to definethe presentation of assets to users.

The System component templates cover the widest variety of use oftemplates. This includes defining persistent object formats, definingthe information needed by a processor to render those formats in acurrent format, defining the choreography and behaviors of objects inaggregate multimedia records, etc.

The System Components will be constantly evolving, adding new templatesas new digital technologies evolve. Each type of system component willhave its own family of templates.

Templates may be used to define the structure of component description.The ERA system will archive itself and be self-describing. Templateswill define elements of information needed for components to be selfdescribing.

Templates may also be used to define the nature and rights of entitiesand the access restrictions on assets in the archive.

A records-centric access model will define restrictions and rights inrelation to records using the internal structure of the recordsthemselves. Templates will define the instructions on records and createthe framework for aligning identity—role—authorization to protect therecords.

Templates may further be used to describe system services andorchestrate services within work flow processes.

The Service Architecture describes the arrangement and delivery ofservices in the ERA system of the present invention, including the workflow processes and the functionality at each step in the process.Templates, expressed for example in Business Process Execution Language(BPEL), may be used to describe the orchestration of functionalservices, and at a lower level, describe the inputs and outputs to eachindividual functional services, using for example Web ServicesDescription Language (WSDL).

A hierarchical scheme according to the present invention may beimplemented for managing templates. The introduction of hierarchy to themanagement of templates adds another level of abstraction. A templateabstracts from a specific instance to the general case. Such a templateis associated to a single type of object. With hierarchy, another layerof abstraction may be added that can be applied to any of: 1) thetemplate, 2) the content which it controls, or 3) both.

As an object subject to a hierarchical arrangement the template becomesa mirror of the organization of objects into increasing larger aggregatestructures which is a method of organization common to the ERA system ofthe present invention as a whole.

Templates can have a hierarchical connotation either because: (a) thetemplate itself can only be instantiated with reference to a hierarchyof templates which collectively define its content, or (b) the objectthe template describes can only be instantiated with reference to ahierarchy of digital items or conceptual arrangements of digital items.

In the first case (a), instantiating the template requires retrievingelements from within different templates within a hierarchy. Forexample, Life Cycle Data document templates (Transfer Agreements,Disposition Agreements, etc) will have their own specific informationelements but will also likely share a set of information elements commonto all Life Cycle Data documents.

The template hierarchy might look like:

ERA.xsd (elements common to the ERA, such as identifiers)

-   -   Life_Cycle_Documents.xsd (elements common to all Life Cycle        documents)        -   Transfer_Agreement.xsd (e.g. SF-258 specific elements)        -   Disposition_Agreement.xsd (e.g. SF-115 specific elements)        -   Preservation_Plan.xsd (elements specific to this template).

In XML Schema, this may be implemented by having each template in eachchild level of the template hierarchy begin with an <include/>instruction that incorporates in the child template all the dataelements described in its parent, which in turn will <include/> all thedata elements in its parent, etc.

In the second case (b), to instantiate a document that conforms to atemplate requires retrieving elements of information from hierarchicallyorganized assets within the archive.

For example the template for archival metadata may include elements ofinformation some of which are associated to a record catalog item thatrepresents the conceptual concept of the entire record (the parent orroot element of the record) while other elements of information areassociated to individual digital items that are components of therecord.

To create a document that represents the archival metadata for aspecific digital item, and which conforms to the archival metadatatemplate, requires retrieving all the information elements from eachlevel in the record's internal hierarchy from that digital item up tothe record's “root”.

For example, suppose that the family of a noted physicist donates herpersonal papers to NARA. The record hierarchy that might look like:

Curie Collection   Family Papers     Professional Papers       ResearchActivities         Reagents

Metadata that describes the <Origin> of the record will likely beassociated with the highest level in the record hierarchy, the “//CurieCollection” level, as the description of <Origin> applies to all thedocuments in that collection.

Metadata that describes the <Digital Object Type> of a specific documentwill be associated with a specific document, such as “//CurieCollection/Professional Papers/Research Activities/Reagents”.

To create an instance of the metadata for the “//Reagents” documentrequires the accretion of the metadata for itself and all its ancestorsas we traverse the record hierarchy up to the collection level.

The possible intersections of templates and hierarchies can be presentedin a matrix as shown in Table 4. Along one axis are the templates;either derived from a hierarchy or self-contained. Along the other axisare the conforming content, again either derived from a hierarchy orself-contained.

The matrix below illustrates where some types of templates may fall inthe matrix.

TABLE 4 Content Axis Template Axis Template is Life Cycle Documenttemplates, Archival metadata, the schema Hierarchical where template isLife Cycle for metadata may be instantiated The template is anDocument + generic Life Cycle by aggregating schemas within aaggregation of template Elements hierarchy of metadata schemas, elementsfrom a and the conforming metadata hierarchy of templates. document maybe created from Document conformance the aggregation of all metadatacannot be tested without elements traversing a record including elementsfrom hierarchy. the hierarchy. Template is Self- System metadata, suchas n/a Contained persistent format definitions The template is a self-Service Architecture templates; contained object. both the hierarchy ofBPEL Document conformance managing WSDL, and within can be testedwithout WSDL the aggregation of generic reference to any other WSDL andthe web service template. specific elements described in XML SchemaContent Self-Contained Content Hierarchal An object that conforms to theThe creation of an object that template is a self-contained object inconforms to the template is achieved its own right and conformance canbe by retrieving all references to it from tested without reference tothe each layer in the hierarchy. The hierarchy to which it belongs.conforming object accretes its content as it traverses the hierarchaltree and is only conforming at the end of the accretion process.

In a self-describing system, each template is both a functionalcomponent of the system and a record in the system. As a record in thesystem, the template is treated the same as any other record, with itsown metadata, life cycle management, and preservation. The ERA system ofthe present invention may be regarded, therefore, as an aggregaterecord, with its own hierarchy of documents, so that part of our ERArecord hierarchy might look like

  ERA     System       Templates         System           Workflow            DispositionWorkflow.bpel (instance of BPEL template)              AddDescriptionService.wdsl (instance of WSDL template)

Each instance of a system component, including templates, has its ownarchival metadata (metadata that describes a record). This lattermetadata makes the component self describing.

For example, a WSDL file is an instance of the template for defining aservice and a BPEL file is an instance of the template that defines awork flow.

The archival metadata of the WSDL file will include information such as;

-   -   What does it do?    -   What work flow does it belong to?    -   What version is this, is it the current version?    -   How does it work—inputs, outputs?    -   Where did the code originate?    -   Are there are intellectual rights associated to this web        service?    -   What is the actual code?

This sort of information could be included in the WSDL file as comments(or <Documentation/> elements) but would not be very manageable as aresult. The system would not be able to apply its record managementfunctionality to its own templates, which is based on archival metadataheld exterior to the digital object the metadata describes,

To make description of the system components manageable, they should bedescribed using the same archival metadata templates as for any record.

While there will be a defined template for a service in the ERA (such asthe XML Schema for WSDL), the present invention may use anothertemplate, the Archival Metadata schema, as the template to describe theservice as a component of the system.

As templates evolve, the life cycle data elements in their descriptioncapture that evolution, such as the version. When a change to a templatechanges the behavior of the system, the earlier version of the templateis preserved as a record so that the previous behavior of the system canbe understood.

Templates will evolve as ERA evolves. As such templates, as records inERA, will be versioned and managed. Life cycle data elements or recordswill include the version of the templates they use. Versioning willallow new templates to be introduced without creating problems withvalidation. Whether life cycle content that is subject to validationagainst templates should be updated as templates evolve will be a policydecision applied to each template.

Each process to update a template may be a standard work flow in theERA, and described in its own template, which will include appropriateapproval and authorization steps as determined in policy.

Templates, as records, will have their own fixity information to ensuretheir integrity and the life cycle data of objects modified by templateswill record which version of which template was used.

The concept of managing templates can be extended to apply to everycomponent of the system. Each software component of the ERA systemshould be described and held in the ERA. This applies to platformapplications, web application components, any client side components, aswell as all the functionality wrapped in web services which can bemanaged within the concept of managing templates as described above.

The concept of preserving original arrangement to the system can also beextended so as to describe in Archival Metadata how all the componentsare structurally linked—creating in essence a schema for the ERA itself.

While the invention has been described in connection with what arepresently considered to be the most practical and preferred embodiments,it is to be understood that the invention is not to be limited to thedisclosed embodiments, but on the contrary, is intended to cover variousmodifications and equivalent arrangements included within the spirit andscope of the invention. Also, the various embodiments described abovemay be implemented in conjunction with other embodiments, e.g., aspectsof one embodiment may be combined with aspects of another embodiment torealize yet other embodiments.

1. A method for managing electronic records, each electronic recordcomprising a data file, a plurality of data files, a portion of a datafile, or portions of a plurality of data files, the electronic recordscomprising a plurality of record types and data file types, the methodcomprising: forming a data file set comprising one or more logicallyrelated data files; identifying attributes of each record type in arecord type template; identifying specifications of each data file typein a data file type template; extracting digital components from thedata file set, wherein the extracted digital components relate to theattributes in each record type template and the specifications in eachdata file type template and comprise an individual record.
 2. A methodaccording to claim 1, further comprising: specifying in each record typetemplate characteristics of authenticity of each record type.
 3. Amethod according to claim 1, wherein the data files of the data file setare logically related for purposes of accessing the extracted digitalcomponents.
 4. A method according to claim 3, wherein accessing theextracted digital components comprises presenting the individual recordin human understandable form.
 5. A method according to claim 3, whereinaccessing the individual record comprises transforming, consolidating,tabulating, formatting, rendering, querying, filtering, and/orinterpreting the individual record.
 6. A method according to claim 4,wherein presenting the individual record comprises presenting the recordperceptible to human senses.
 7. A method according to claim 1, whereinthe data files of the data file set are logically related by a manner ofpresentation.
 8. A method according to claim 3, wherein thespecifications of each data file type comprise instructions foraccessing the individual record.
 9. A method according to claim 1,wherein the data files of the data file set are logically related byinformation contained in the data files.
 10. A method according to claim1, further comprising: extracting default digital components from thedata file set when attributes of a record type and/or specifications ofa data file type are unavailable.
 11. An electronic record archive formanaging electronic records, each electronic record comprising a datafile, a plurality of data files, a portion of a data file, or portionsof a plurality of data files, the electronic records comprising aplurality of record types and data file types, the electronic recordarchive comprising: a data file set comprising one or more logicallyrelated data files; a record type template for each record type, eachrecord type template identifying attributes of each record type; a datafile type template for each data file type, each data file type templateidentifying specifications of each data file type; and a digitalcomponent extractor configured to extract digital components from thedata file set, wherein the extracted digital components relate to theattributes in each record type template and the specifications in eachdata file type template and comprise an individual record.
 12. Anelectronic record archive according to claim 11, wherein each recordtype template specifies characteristics of authenticity of each recordtype.
 13. An electronic record archive according to claim 11, whereinthe data files of the data file set are logically related for purposesof accessing the extracted digital components.
 14. An electronic recordarchive according to claim 13, further comprising an accessing componentconfigured to present the individual record in human understandableform.
 15. An electronic record archive according to claim 13, furthercomprising an accessing component configured to access the individualrecord by transformation, consolidation, tabulation, formation,rendition, questioning, filtering, and/or interpretation of theindividual record.
 16. An electronic record archive according to claim14, wherein the accessing component is configured to present theindividual record perceptible to human senses.
 17. An electronic recordarchive according to claim 11, wherein the data files of the data fileset are logically related by a manner of presentation.
 18. An electronicrecord archive according to claim 13, wherein the specifications of eachdata file type comprise instructions for accessing the individualrecord.
 19. An electronic record archive according to claim 11, whereinthe data files of the data file set are logically related by informationcontained in the data files.
 20. An electronic record archive accordingto claim 11, wherein the digital component extractor is configured toextract default digital components from the data file set whenattributes of a record type and/or specifications of a data file typeare unavailable